Question: Using the information provided, can we determine the sex of a penguin without having to take biological samples of a penguin?
Background Information: Penguins are usually a fan-favorite of any zoo or aquarium. However they live in the wild in Antarctica. Researchers have been documenting penguin populations on three islands in Antarctica, and recording different measurements and features to keep track of the penguins populations as climate change occurs. It is important to have an understanding of the male and female populations in colonies, however, this is difficult because penguins are sexually monomorphic birds. This means that both biological sexes are phenotypically indistinguishable from each other. There are two methods to identifying the sex of a penguin. The traditional methods included cloacal examination, biochemical and cytogenic analysis, and sound discrimination, which can stress out the penguins. The other method includes using a molecular method for sex identification, however this requires using amplification of the chromo-helicase-DNA-binding 1 gene found on the sex chromosomes, which is a complicated procedure. These processes can be complex and require long hours in the lab. However, our model is predicted to make this identification method easier and more efficient.
Check out more about Penguin Sex Identification here
Dataset We will be looking at a dataset of 344 penguins from Palmer Archipelago (Antarctica).
The variables used in the analysis are:
Note: The culmen is the upper ridge of a bird’s beak
The Chinstrap penguin is the least common, however the count overall is pretty similar indicating there is not a strong unbalence in the data by species.
Overall all penguin species have an even split of male and female penguins. The NA values represent the penguins whose sex was not able to be identified, which the goal of the model is to be able to predict those unknown values for researches documenting penguins.
Within each species there is also an even split of male and female penguins. The summary statistics also show the distributions of the other variables used to predict the sex.
## Species Island Clutch_Completion Culmen_Length
## Adelie :152 Biscoe :44 0: 14 Min. :32.10
## Chinstrap: 0 Dream :56 1:138 1st Qu.:36.75
## Gentoo : 0 Torgersen:52 Median :38.80
## Mean :38.79
## 3rd Qu.:40.75
## Max. :46.00
## NA's :1
## Culmen_Depth Flipper_Length Body_Mass Sex Delta_15_N
## Min. :15.50 Min. :172 Min. :2850 0 :73 Min. :7.698
## 1st Qu.:17.50 1st Qu.:186 1st Qu.:3350 1 :73 1st Qu.:8.567
## Median :18.40 Median :190 Median :3700 NA's: 6 Median :8.881
## Mean :18.35 Mean :190 Mean :3701 Mean :8.860
## 3rd Qu.:19.00 3rd Qu.:195 3rd Qu.:4000 3rd Qu.:9.153
## Max. :21.50 Max. :210 Max. :4775 Max. :9.795
## NA's :1 NA's :1 NA's :1 NA's :11
## Delta_13_C
## Min. :-26.79
## 1st Qu.:-26.23
## Median :-25.98
## Mean :-25.80
## 3rd Qu.:-25.30
## Max. :-23.90
## NA's :11
## Species Island Clutch_Completion Culmen_Length
## Adelie : 0 Biscoe : 0 0:14 Min. :40.90
## Chinstrap:68 Dream :68 1:54 1st Qu.:46.35
## Gentoo : 0 Torgersen: 0 Median :49.55
## Mean :48.83
## 3rd Qu.:51.08
## Max. :58.00
##
## Culmen_Depth Flipper_Length Body_Mass Sex Delta_15_N
## Min. :16.40 Min. :178.0 Min. :2700 0:34 Min. : 8.472
## 1st Qu.:17.50 1st Qu.:191.0 1st Qu.:3488 1:34 1st Qu.: 9.104
## Median :18.45 Median :196.0 Median :3700 Median : 9.374
## Mean :18.42 Mean :195.8 Mean :3733 Mean : 9.356
## 3rd Qu.:19.40 3rd Qu.:201.0 3rd Qu.:3950 3rd Qu.: 9.620
## Max. :20.80 Max. :212.0 Max. :4800 Max. :10.025
## NA's :1
## Delta_13_C
## Min. :-25.15
## 1st Qu.:-24.69
## Median :-24.57
## Mean :-24.55
## 3rd Qu.:-24.40
## Max. :-23.79
##
## Species Island Clutch_Completion Culmen_Length
## Adelie : 0 Biscoe :124 0: 8 Min. :40.90
## Chinstrap: 0 Dream : 0 1:116 1st Qu.:45.30
## Gentoo :124 Torgersen: 0 Median :47.30
## Mean :47.50
## 3rd Qu.:49.55
## Max. :59.60
## NA's :1
## Culmen_Depth Flipper_Length Body_Mass Sex Delta_15_N
## Min. :13.10 Min. :203.0 Min. :3950 0 :58 Min. :7.632
## 1st Qu.:14.20 1st Qu.:212.0 1st Qu.:4700 1 :61 1st Qu.:8.103
## Median :15.00 Median :216.0 Median :5000 NA's: 5 Median :8.251
## Mean :14.98 Mean :217.2 Mean :5076 Mean :8.245
## 3rd Qu.:15.70 3rd Qu.:221.0 3rd Qu.:5500 3rd Qu.:8.418
## Max. :17.30 Max. :231.0 Max. :6300 Max. :8.834
## NA's :1 NA's :1 NA's :1 NA's :2
## Delta_13_C
## Min. :-27.02
## 1st Qu.:-26.69
## Median :-26.22
## Mean :-26.19
## 3rd Qu.:-25.64
## Max. :-25.00
## NA's :2
Adelie are the only penguin to appear on every island. The Gentoo are only on Biscoe and the Chinstrap are only on Dream. This indicates that there is something specific about the environment of each island that is optimal for the different species of penguins.
The Random Forest model was used to predict the sex of a penguin solely from its physical features (such as flipper length and body mass) rather than a biological sample (such as a blood test). The Random Forest model was chosen in an attempt to avoid the problem of over-fitting which is important to consider especially with a smaller dataset such as the penguin dataset. Since there is less data, the model is more likely to pick up any “noise” or fluctuations in the training data and consequently learn those as rules or concepts. Unlike using a single decision tree, the Random Forest model reduces over-fitting through bootstrapping: it uses many different decision trees to evaluate different subsets of variables on different subsets of data. In the case of determining the sex of a penguin where male is arbitrarily defined as the positive class, both false positives and false negatives are about equally undesirable, so we do not necessarily need to favor reduction of one over the other (unlike with cancer data, for example, where false negatives are more detrimental than false positives). Thus, our model aims to reduce both error rates as much as possible in order to have an accurate understanding of a given population of penguins and potentially use this data for related areas of research such as reproduction rates or sex-based disease studies. To begin, we split the data into 50% train, 25% tune, and 25% test. However, when we ran the model on this split, it was giving unexpected results that we concluded must have been due to the fact that the tune set was so small. For example, the OOB graph suggested that we decrease the number of trees, but when this change was made, the model performed worse, despite changing other hyperparameters such as sample size. Thus, we decided to resplit the dataset into 75% train and 25% test, and use the train dataset to tune the model. This produced much better results.
According to the confusion matrix, the initial model had a false positive classification error rate of 10.6% and a false negative classification error rate of 9.99%. The True Positive rate was calculated to be 89.3% while the True Negative rate was calculated to be 90.2%. Thus, the initial model is slightly better at predicting female penguins than male penguins, but they are very close, which is what we desired.
## 0 1 class.error
## 0 106 17 0.13821138
## 1 11 110 0.09090909
## [1] 0.8844154
The figure shows that the error starts to level out at around 200 trees. To improve our model, we decreased the number of trees and increased the sample size. The number of trees was set to 200 trees and the sample size was set to 30 and the node size was set to 3.
The false positive classification error rate decreased from the first model of 10.6% to 9.8% in the second model. The false negative classification error rate stayed the same at 9.99%. The specificity of the model was calculated to be 90.2%, which is a good number for the model and the True Positive rate was calculated to be 90.1% which signifies that the model does well in predicting male penguins. The evaluation metrics combined further indicate that the second optimized model is able to better predict male penguins and similarly predicts the female penguins when compared to the original model.
## 0 1 class.error
## 0 106 17 0.13821138
## 1 11 110 0.09090909
## 0 1 class.error
## 0 110 13 0.1056911
## 1 13 108 0.1074380
## [1] "False Positive Rate"
## [1] 0.09756098
## [1] "Specificity"
## [1] 0.902439
## [1] "True Positive Rate"
## [1] 0.9008264
When the variable importances are plotted, it is shown that the body mass and the culmen depth of a penguin are the two most important factors in determining a penguin’s sex. The higher the mean decrease accuracy and the mean decrease gini, the more important a factor is.
The accuracy of the dataset was found to be 93% which means that the model’s ability to predict the sex of the penguin is trustworthy. The accuracy is a good evaluation metric for this data set because the data set is considered more balanced than unbalanced. The sensitivity of the model is 93.4% which means that the model is able to accurately predict a male penguin in the dataset. The prevalence, also known as base rate, of a male penguin in the dataset was calculated to be 49.6%. The detection prevalence of male penguins in the dataset from the model was calculated to be 50% which is similar to the base rate.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 113 6
## 1 10 115
##
## Accuracy : 0.9344
## 95% CI : (0.8957, 0.9621)
## No Information Rate : 0.5041
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8689
##
## Mcnemar's Test P-Value : 0.4533
##
## Sensitivity : 0.9504
## Specificity : 0.9187
## Pos Pred Value : 0.9200
## Neg Pred Value : 0.9496
## Precision : 0.9200
## Recall : 0.9504
## F1 : 0.9350
## Prevalence : 0.4959
## Detection Rate : 0.4713
## Detection Prevalence : 0.5123
## Balanced Accuracy : 0.9346
##
## 'Positive' Class : 1
##
The model with the testing dataset also had an accuracy of 93%. The sensitivity of the model with the testing data is 90% and the specificity is 95% which gives us evidence that the model can be trusted to identify male penguins.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 37 3
## 1 3 37
##
## Accuracy : 0.925
## 95% CI : (0.8439, 0.972)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 2.698e-16
##
## Kappa : 0.85
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9250
## Specificity : 0.9250
## Pos Pred Value : 0.9250
## Neg Pred Value : 0.9250
## Precision : 0.9250
## Recall : 0.9250
## F1 : 0.9250
## Prevalence : 0.5000
## Detection Rate : 0.4625
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9250
##
## 'Positive' Class : 1
##
The main question our model was trying to provide information for was if we can determine the sex of a penguin without having to take biological samples of a penguin? The Palmer Archipelago (Antarctica) penguin data included records of 344 penguins, but multiple penguins were missing information about their sex. According to the data this was due to the fact that “no blood sample obtained for sexing.” We were able to successfully develop a model that would bypass the need for taking blood samples of the penguins to identify their sex or to fill in the gaps when the sex of the penguin was not able to be identified using traditional methods.
Using the random forest model we were limited to adjusting the number of trees, the sample size of each tree, mtry, and a few other variables. Due to the small size of our data more repetitions of building trees and smaller sample sizes for each tree to ensure the same tree is not being built multiple times. The random forest model that was built found that body mass and culmen depth are the most important variables in identifying the sex of a penguin. Surprisingly the species of the penguin was on the lower side of importance, meaning that the factors that go into the sex of a penguin have less to do about what type of penguin it is and more about the build of the penguin.
Since the goal is to identify the penguin’s sex when it was not able to be identified using blood, there is not a specific preference to reduce false negative or false positives, rather to just reduce both as much as possible. The positive class is Male therefore a false negative would be identifying a male as a female and a false positive would be identifying a female as a male. When tuning our model we found that we were able to get the false positive rate to 10.6% and the false negative rate to 8.3%. Overall since the data set is so small it is difficult to tell how this will transfer to penguin sex identification in the real world, but it appears that the model is able to identify the penguins sex without misclassifying over 90% of the time.
After using our test data to evaluate the model the overall accuracy is 82.5% with a specificity of 85% and sensitivity of 80%. The prevalence of male penguins in the test data was 50%, so the performance of the model being in a range of about 80-85% indicates that the model is performing well. While there is a decrease compared to the training data, this can be correlated to the fact that the data we have is so limited, minimizing the robustness of the model. One of the sub goals we had was to not overfit the data, and this can still be improved, but overall the model appears to be able to be applied to penguins outside of the training dataset.
Based on our evaluation we can conclude that the random forest model can be used by researchers to identify the sex of a penguin in the case that the traditional methods of sex identification can’t be performed. With the increase of global warming, penguins are being studied to understand the effects, and the factor of sex is important in understanding how penguin populations are changing overtime. While taking blood is the most accurate way of identifying the sex of a penguin, we feel that our model performs to the level that it can be an additional tool in conservation efforts.
The main limitation that was faced was the size of the penguin data available. Overall there were only 344 penguins recorded in Palmer Archipelago (Antarctica). This is due in part to the fact that it is difficult to find penguins who have not been tagged and recorded yet. It is also due to the fact that the Penguin population in Antarctica is decreasing due to global warming limiting the habitable area for penguins. Optimally to better understand how to identify the sex of a penguin more data would be needed to train our model. This data could come from zoos or historical data since the penguin populations in the wild are limited. Additionally, it would be interesting to look at how the coloration of a penguin may affect the identification of their sex. Since many species look different depending on if they are male or female, if that information could be incorporated into the model it may lead to better identification. Since the long term goal of the model is to identify the penguin’s sex without having to draw blood it would also be interesting to see the model’s performance if the Delta 13 and Delta 15 blood isotope values were not included in the data.